Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

نویسندگان
چکیده

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Unknown Word Extraction by Mining Maximized Substrings

The issue of identifying out-of-vocabulary (OOV) words is a major difficulty in Chinese word segmentation. We address this issue by applying a very efficient algorithm for extracting maximized substrings (Shen et al., 2013) from a large-scale raw text, which form a list of unknown word candidates. We then apply techniques such as Short-term Store and Lexicon-based Voting to reduce the noises in...

متن کامل

Chinese Word Segmentation by Mining Maximized Substrings

A major problem in the field of Chinese word segmentation is the identification of out-ofvocabulary words. We propose a simple yet effective approach for extracting maximized substrings, which provide good estimations of unknown word boundaries. We also develop a new semi-supervised segmentation technique that incorporates retrieved substrings using discriminative learning. The effectiveness of...

متن کامل

Unknown Word Extraction for Chinese Documents

There is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknown words. Most previous works focus their attention only on the resolution of ambiguous segmentation. The problem of unknown word identification is considered more difficult and needs further investigation. Conventionally unknown wor...

متن کامل

Chinese Unknown Word Translation by Subword Re-segmentation

We propose a general approach for translating Chinese unknown words (UNK) for SMT. This approach takes advantage of the properties of Chinese word composition rules, i.e., all Chinese words are formed by sequential characters. According to the proposed approach, the unknown word is re-split into a subword sequence followed by subword translation with a subwordbased translation model. “Subword” ...

متن کامل

Word Boundary Information and Chinese Word Segmentation

Chinese word segmentation could be considered as a problem of word boundary recognition. Word boundary information plays a significant role in human language acquisition and automatic segmentation for Natural Language Processing (NLP). Extraction of word boundary information involves cognitive psychology, computational linguistics, and language education. Methods utilizing word boundary informa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Natural Language Processing

سال: 2016

ISSN: 1340-7619,2185-8314

DOI: 10.5715/jnlp.23.235